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1  Introduction 


The  class  of  Jacobi  rotation  methods  [4,7,10,12]  for  computing  the  sym¬ 
metric  eigenvalue  decomposition 

A  =  UDUt  (1) 


of  an  n  x  n  real  matrix  A ,  where  U  is  orthogonal  and  D  is  diagonal,  has 
generated  substantial  interest  in  recent  years,  particularly  in  the  context 
of  parallel  computer  architectures  Algorithms  have  been  developed  for 
systolic  processor  arrays  as  well  as  for  more  general  purpose  parallel  com¬ 
puters.  These  methods  differ  principally  from  the  original  method  of  Jacobi 
in  that  they  choose  a  fixed  sequence  of  matrix  elements  for  the  necessary 
orthogonal  rotations.  Jacobi’s  method  performs  a  rotation  to  zero  out  the 
largest  off-diagonal  element  at  each  step;  the  sequence  of  rotations  is  data- 
dependent. 

This  paper  presents  a  novel  block  matrix  or  “hypermatrix”  adapta¬ 
tion  [2,3,16]  of  the  original  algorithm,  which  we  label  the  Blocked  Classical 
Jacobi  (BCJ)  algorithm.  The  matrix  A  is  treated  as  a  smaller  m  x  m  matrix 
of  6  x  6  submatrices;  computations  work  on  entire  submatrices  rather  than 
on  scalars.  Furthermore,  the  sequence  of  submatrices  to  be  rotated  is  cho¬ 
sen  to  locally  maximize  the  reduction  of  A  to  diagonal  form  by  selecting  the 
off-diagonal  blocks  of  largest  mass.  BCJ  reduces  serial  runtimes  compared 
with  other  blocked  Jacobi  methods  on  a  particular  class  of  inputs. 

For  computers  with  a  hierarchical  memory  system,  in  which  succes¬ 
sively  larger  yet  slower  memories  are  located  at  increasing  distances  from 
the  arithmetic  processor,  many  numerical  calculations  are  efficiently  struc¬ 
tured  in  terms  of  block  algorithms  [3,8,15].  Rather  than  computing  with 
scalar  quantities,  block  algorithms  operate  on  small  square  or  rectangular 
submatrices  of  data.  The  resulting  “surface-to-volume”  effect  of  a  single 
block  data  transfer  followed  by  several  computations  allows  a  fast  processor 

with  local  memory  to  achieve  nearly  full  utilization  even  when  supplied  by  ~T - 

a  significantly  slower  bus  or  main  memory.  - 

The  blocked  organization  of  BCJ  reduces  the  overhead  cost  of  deter- 
mining  the  maximum  off-diagonal  elements.  It  also  makes  BCJ  especially  ccj 
well-suited  for  implementation  on  multiprocessors  with  a  hierarchical  mem-  i 
ory  system  ( e.g [8]).  As  well,  BCJ  is  suitable  for  parallel  implementation. 
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However,  the  significantly  shorter  runtimes  of  the  sequential  Jacobi  suggest 
that  it  is  to  be  preferred  over  all  blocked  methods  on  this  class  of  inputs. 

The  organization  of  the  paper  is  as  follows.  Section  2  gives  a  brief  review 
of  serial  Jacobi  methods  for  the  symmetric  eigenproblem.  Section  3  gives 
the  motivations  for  BCJ  and  presents  the  algorithm  as  implemented  in  this 
study.  Section  4  lays  out  the  numerical  experiments  with  BCJ,  including 
timings  and  numbers  of  iterations  to  convergence.  Section  5  presents  the 
analysis  of  the  block  selection  method  using  the  theory  of  order  statistics, 
and  discusses  the  implications  for  the  experimental  data.  Concluding  re¬ 
marks  and  indications  for  parallel  implementations  are  presented  in  section 
6.  Section  7  contains  the  proofs  of  two  probabilistic  results  from  section  5. 


2  Review  of  Serial  Jacobi  Methods 


The  Jacobi  method  of  solving  (1)  constructs  a  sequence  of  orthogonal  ro¬ 
tations  TJ\  =  Ui  —  U(02,i2,j2,Aw),. . . ,  such  that  U  = 

UiU2---  diagonalizes  A  (that  is,  UT AU  is  diagonal),  0  <  0,  <  7r/4,  and 
lim,_oo  0,  =  0.  In  practice  the  computation  is  terminated  after  a  finite 
number  of  rotations,  leaving  U  —  UiU2  . . . ,  Un-  The  rotation  t/„  is  selected 
to  zero  out  the  matrix  elements  in  positions  (iu,jv)  and 

Given  (i,  j)  =  jV),  the  rotation  angle  9„  is  computed  so  that  A^  = 
Uj A^u~^Uv,  according  to 


with  a\^  =  =  0;  here  =  A.  The  cosine  cv  and  sine  su  of  the  angle 

du  may  be  calculated  by  [9] 


r  =  (a!r"  -  aSr,,)/(2o!;-'>),  „!;-■>  +  0, 

then  solving  for  t  in 

sign(r) 


C+2tr  =  l  I  t  = 


|r|  +  y/i  +  r2 


and  substituting  in 


:„  =  (!-) -t2)  l/2,  sv  =  cj. 


(3) 

(4) 

(5) 
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U„  is  set  to  the  identity  matrix,  except  in  rows  and  columns  iv  and  jv, 
where  it  is  zero  everywhere  but  in  the  2x2  principal  submatrix;  there  it  is 

f  Cv  j.  If  =  0  then  c„  is  set  to  1  and  s„  to  0,  for  8„  is  obviously 

\  Cu  J 

0. 


There  are  several  methods  for  choosing  the  rotation  index  pair  (i,j)- 
The  classical  Jacobi  method  selects  (i,j)  at  each  step  to  locally  minimize 
the  resulting  off-diagonal  Frobenius  norm  by  choosing  (i,  j)  as  the  location 
of  the  largest  off-diagonal  element.  However,  the  effort  of  determining  the 
location  of  the  maximum  element  (0(n2)  operations)  exceeds  the  work  in 
calculating  and  applying  the  orthogonal  rotation  U„  (approximately  18n 
operations  neglecting  symmetry).  For  this  reason  the  method  is  rarely 
used  on  computers. 

The  cyclic-by-rows  ordering  of  elements  (( i,j  )  =  (1,2),  (1, 3),  . . . ,  (1,  n), 
(2,3),  ...,  (2 ,n),  ...,  (n  —  l,n))  is  more  amenable  to  automatic  compu¬ 
tation.  However,  the  successive  index  pairs  axe  almost  always  dependent 
(sharing  a  row  or  column),  and  thus  not  suited  for  parallel  computation. 
Parallel  orderings  have  featured  other  index  pair  selections  chosen  for  data 
locality  and  utility  on  a  systolic  processor.  The  Brent-Luk  and  Sameh  or¬ 
derings  [4,13]  have  many  desirable  features.  They  preserve  data  locality  and 
are  amenable  to  systolic  or  other  parallel  implementations,  they  converge 
faster  than  the  cyclic-by-rows  ordering,  and  they  rotate  each  off-diagonal 
element  exactly  once  in  a  “sweep.”  A  particularly  useful  feature  is  that 
at  each  step,  the  n/2  independent  rotations  (operating  on  n/2  mutually 
distinct  pairs  of  rows  and  columns)  may  be  carried  out  simultaneously. 


3  Algorithm  BCJ 

We  now  develop  a  blocked  analogue  of  the  classical  Jacobi  algorithm  for 
the  symmetric  eigenproblem  (1)  that  performs  more  work  in  selecting  the 
index  pairs  yet  requires  less  run-time  than  a  blocked  Brent-Luk  ordering. 
BCJ  also  generalizes  to  computation  of  the  singular  value  decomposition  of 
a  rectangular  matrix.  The  new  Blocked  Classical  Jacobi  (BCJ)  method 
selects  the  largest  off-diagonal  block(s)  for  rotation,  in  order  to  locally 
minimize  the  off-diagonal  mass  of  A.  Through  a  suitable  choice  of  the 
block  size,  the  extra  computations  to  determine  the  off-diagonal  block  of 
maximum  mass  are  offset  by  a  reduced  number  of  iterations;  BCJ  is  more 
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efficient  on  a  serial  computer  than  the  blocked  Brent-Luk  Jacobi  method. 

BCJ  is  also  highly  parallel  in  nature.  Where  several  processors  are 
available  to  solve  a  single  eigenproblem,  the  K  >  1  largest  independent 
off-diagonal  blocks  may  be  selected  for  simultaneous  rotations,  leading  to 
a  straightforward  parallel  implementation. 

At  each  iteration,  BCJ  selects  an  off-diagonal  block  submatrix  (i,j) 
for  rotation  and  computes  a  block  orthogonal  rotation  matrix  Uv,  which  it 
then  applies  to  help  reduce  A  to  block  diagonal  form.  The  block  orthogonal 
rotation  can  be  chosen  as  a  sequence  of  scalar  Jacobi  rotations  or  from  the 
eigendecomposition  of  the  small  block  matrix;  we  use  a  full  scalar  Sameh 
sweep  on  the  small  block  matrix.  (However,  there  is  no  restriction  that  the 
small  block  matrix  must  be  diagonalized,  only  that  its  off-diagonal  mass  be 
reduced.  Computations  by  Bischof  [1]  on  the  SVD  indicate  that  the  extra 
effort  of  completely  diagonalizing  the  block  matrix  at  each  step  may  be 
wasted.)  The  method  then  proceeds  by  selecting  another  block  element  of 
A  to  rotate.  A  final  processing  step  of  Sameh  sweeps  forms  the  eigenvalues 
and  eigenvectors  from  the  block  diagonal  elements  of  A.  On  an  mxm  block 
matrix,  a  BCJ  “sweep”  is  (m2  —  m)/2  two-block  by  two-block  rotations. 

The  precise  block  algorithm  for  carrying  out  BCJ  to  compute  the  sym¬ 
metric  eigendecomposition  (1),  with  D  overwriting  A,  is  as  follows.  Assume, 
for  ease  of  exposition,  that  the  block  size  b  divides  the  matrix  size  n  exactly, 
so  that  n  =  mb.  K  >  1  independent  off-diagonal  blocks  may  be  selected  for 
simultaneous  rotation.  The  iterations  continue  until  a  tolerance  criterion 
TOL  is  met.  The  method  begins  with  U  =  I,  v  —  0,  and  continues 

1.  Compute  the  squared  masses  {M.-j} ■nJ=1,  with 

Mij  =  E  (a(t-l)6+r,(j-l)fc+«)  • 

rf5=l 

2.  Select  K  independent  rotation  pairs  {ik,jk),\  <  k  <  I\  with 

Mikjk  =  max(MtJ\i  $  {ti,...  ,ik-i},j  £  {j\ . jfc-i})- 

3.  Compute  K  block  rotations  { U{6k ,  ik,jk,  to  reduce  the  block 

off-diagonal  mass  of  A  (as  indicated  below). 
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4.  Apply  the  block  rotations  of  step  3  to  U  and  to  A^u\  forming 

5.  If  the  block  off-diagonal  mass  of  AS"*1)  is  not  less  than  TOL  times  the 
block  diagonal  mass,  then  set  v  :=  v  +  1  and  go  to  step  1. 

6.  Diagonalize  the  diagonal  blocks  of  AS")  (until  the  off-diagonal  mass  is 
less  than  TOL  times  the  diagonal  mass)  and  update  U . 

Step  3  of  our  BCJ  implementation  uses  a  single  scalar  Sameh  sweep  to 
reduce  the  off-diagonal  mass  of  the  two-block  by  two-block  submatrix.  This 
sweep  includes  6(26  —  1)  pointwise  rotations  performed  sequentially.  Step 
6  uses  successive  Sameh  sweeps  to  diagonalize  the  block  diagonals  of  ASuK 

The  BCJ  algorithm  is  to  be  compared  against  the  “block  Brent-Luk” 
algorithm,  which  omits  step  1  and  replaces  step  2  by  selecting  m/2  block 
index  pairs  according  to  the  Brent-Luk  ordering.  A  block  Brent-Luk  sweep 
also  involves  (m2  —  m)j 2  two-block  by  two- block  rotations.  It  is  impor¬ 
tant  to  note  that  the  two  methods  under  comparison  differ  only  in  their 
index  selection  methods.  In  figure  5  we  also  display  timings  for  the  Sameh 
ordering  of  the  scalar  Jacobi  and  the  Eispack  (TRED2/TQL2)  methods. 

4  Experimental  Results 

Several  numerical  experiments  were  conducted  to  compare  the  efficiency  of 
BCJ  and  blocked  Brent-Luk  symmetric  eigensolvers  on  matrices  of  random 
data.  The  test  matrices  were  generated  as  matrices  of  uniform  random 
deviates  from  (0,1];  in  each  case  10  tests  were  run  to  give  non-parametric 
error  bounds  to  within  10%.  All  computations  were  carried  out  with  a 
tolerance  of  TOL  =  10-8.  Table  1  summarizes  the  run  times  of  the  two 
methods  on  problems  with  various  values  of  n,  m,  6,  and  k.  Comparable 
average  Eispack  times  from  TRED2/TQL2  are  1.46,  0.26,  and  0.05  seconds 
for  n  =  64, 32, 16,  respectively.  Comparable  average  times  for  the  Sameh 
ordering  of  scalar  Jacobi  sweeps  are  6.34,  0.79,  and  0.11  seconds,  for  n  = 
64, 32, 16,  respectively. 

Figures  1-4  display  iteration  counts  and  relative  efficiencies  of  the  two 
algorithms.  Figure  6  shows  the  average  runtimes  for  the  four  methods  with 
their  optimal  6  values,  for  n  =  16,32,64,128.  Table  2  shows  the  number 
of  scalar  rotations  required  by  the  Jacobi,  BCJ,  and  BBL  methods  for  the 
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BCJ  execution  times  I  Blocked  Brent- 


BCJ 

MIN 


26.04 
22.11 
8  31.48 
16  58.36 
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BCJ 

AVG 


26.93 

25.48 

39.97 

107.90 


4.09 

5.20 

13.50 


.72 


BCJ 

MAX 


28.68 

29.98 

66.47 

253.24 


4.71 

7.66 

23.14 


.83 


B-L 

MIN 


44.96 

30.02 

38.19 


6  4  I  0.52  1.40  7.11 


B-L 

AVG 


79.39 

63.13 

145.70 


97.38  242.68 


3.70  8.04 

5.34  9.14 

9.41  16.55 


.45 

2 

.62 

1 

Luk  times 


B-L 

MAX 


136.12 

146.09 

424.32 

513.49 


15.39 

19.00 

31.36 


4.23 

2.36 


Table  1:  Multiflow  Trace/7  (compiler  version  1.5.3)  execution  times  (sec.) 
for  BCJ,  blocked  Brent-Luk,  TQL  =  10"8,  10  trials. 


N 

B 

Jacobi 

AVG 

BBL 

AVG 

BCJ 

AVG 

16 

2(4) 

640 

1381 

5668 

32 

2 

3072 

4128 

11242 

64 

4 

14029 

26691 

91422 

128 

8 

62669 

147510 

Table  2:  Multiflow  Trace/7  (compiler  version  1.5.4):  Average  number  of 
scalar  rotations  for  Jacobi,  BCJ,  and  BBL,  T0L  =  10-8,  10  trials. 


same  problems  as  in  Figure  6.  It  is  clear  that  the  Eispack  and  scalar  Jacobi 
methods  yield  superior  speeds  and  that  the  execution  time  is  dependent 
on  the  number  of  scalar  rotations.  Data  for  figures  1-4  and  table  1  were 
computed  using  version  1.5.3  of  the  Multiflow  Trace  compiler;  table  2  and 
figure  6  rely  upon  version  1.5.4. 

These  experiments  show  that  the  extra  work  of  finding  the  largest  in¬ 
dependent  off-diagonal  blocks  is  offset  by  faster  algorithmic  convergence  of 
BCJ,  which  makes  the  present  method  competitive  with  the  blocked  Brent- 
Luk  technique.  However,  the  block  rotations  require  more  scalar  work  than 
the  scalar  method;  a  block  rotation  seems  to  contribute  less  toward  diago¬ 
nalizing  the  matrix  than  an  equal  number  of  scalar  rotations  performed  in 
the  Sameh  order. 


5  Algorithmic  Analysis 

An  important  factor  in  determining  the  efficiency  of  the  algorithm  is  the 
block  size.  BCJ  has  the  following  computational  work  per  iteration  (n  = 
mb): 

Step  Computations 

1  2  n2 

2  0(m2  log  m) 

3  6A'(26)(26  —  l)/2 

4  \8nK(2b)(2b  —  l)/2 

5  2n2 

6  2n2b  +  0{nb2) 

The  work  for  step  1  is  actually  completed  in  step  5,  where  the  block 
masses  axe  computed,  so  that  after  the  first  iteration  step  1  contributes  no 
work.  Step  2  can  be  done  in  0(Km 2)  operations,  which  is  an  improvement 
if  I\  =  o(logm).  Step  6  is  performed  once  at  the  end  of  the  calculation 
and  has  asymptotically  negligible  work  if  b  =  o(n);  for  very  large  b  step  6 
dominates  the  total  work. 

A  moderate  value  of  b  should  be  preferable  in  order  to  maximize  the 
sum  of  off-diagonal  block  masses.  Indeed,  Figure  1  reflects  this  behavior. 
For  small  6,  the  overhead  of  determining  the  largest  block  exceeds  the  work 
of  diagonalizing  A.  As  6  increases,  the  maximal  off-diagonal  squared  block 
mass  will  approach  the  average  block  mass,  reducing  the  effect  of  each 
block  rotation,  and  consequently  lengthening  BCJ  computations.  Figure  3 
shows  that  for  several  matrix  sizes  n,  increasing  the  block  size  b  increases 
monotonically  the  number  of  sweeps  of  BCJ,  as  expected.  Furthermore, 
the  example  of  Figure  5  indicates  that  with  relatively  few  blocks,  two  large 
off-diagonal  masses  are  likely  to  be  dependent. 

We  now  examine  BCJ’s  behavior  with  a  brief  review  of  relevant  order 
statistics  theory  [6,11],  which  describes  the  behavior  of  sorted  random  vari¬ 
ates  in  terms  of  the  probability  distributions  of  the  individual  elements. 
Given  independent  and  identically  distributed  random  variables  Ai,  A'2, 
. . . ,  A/y,  the  N  order  statistics  A'j*^,  A'j  jV,  . . . ,  Ay  iV  are  the  random  vari¬ 
ables  associated  with  the  lowest  ranked  to  highest  ranked  A’,. 

A  particular  instance  of  the  theory  is  instructive  with  regard  to  BCJ, 
which  starts  with  the  (n2  —  n)/2-sized  upper  triangular  array  from  an  n  x  n 
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symmetric  matrix  .4*°*  of  uniform  random  variates.  Let  be  one  of  (n2  — 
n)/2  lid  uniform  variates  on  the  interval  (0, 1],  and  set  X,  =  Y2.  Then  an 
average  off-diagonal  element  of  A*0*  has  squared  mass  E[Xx]  =  1/3,  while 
the  maximum  has  mean  squared  mass  E[X(n2_ny2 ,(n2_n)/2]  =  1  —  2/(n2  — 
n  4- 2).  Selecting  the  maximum  off-diagonal  element,  rather  than  an  average 
element,  increases  the  reduction  to  diagonal  form  of  an  individual  rotation. 

Assuming  further  that  each  JY,,  1  <  i  <  (m2  —  m)/2,  is  distributed  as 
the  sum  of  b 2  squares  of  uniform  random  values  from  the  interval  (0,  1], 
as  is  Mij  in  the  first  step  of  our  blocked  experiments,  it  is  clear  that  the 
central  limit  theorem  applies  to  the  block  mass  distributions.  For  large  b, 
one  may  represent  the  off-diagonal  squared  block  mass  as  a  normal  random 
variable  with  mean  fj.  =  b2 / 3  and  variance  a2  =  4f>2/45  (corresponding  to 
the  sum  of  62  uniform  random  variables). 

The  maximal  order  statistic  for  these  large  blocks  tends  toward  a  stan¬ 
dard  limiting  distribution,  from  which  we  may  determine  the  moments. 
Although  the  example  employs  sums  of  uniform  variates,  the  proposition 
holds  for  any  blocks  that  have  asymptotically  normal  squared  mass. 

Proposition  1.  Let  {„Y,-}j™I-m^2  be  iid  normal  variates  with  mean  /x  and 
variance  cr2.  In  the  limit  as  m  —*  oo,  the  expected  largest  variate  is 


=  f*  +  <r\/21og((m2  -  m)/2) 


O  (tag  tag  m/ yjlo g  rnj 


and  the  variance  is 


Varf  V*  2  HI) -HI) 

lA(mJ-m)/2.(mJ-m)/2J  °  n  ]nr(  (  m*  -  m  \ 


2 log ((m2  -  m)/2) 


+  O  (tag  tag  mj (tag  m)2^  , 

where  r'(- )  and  r"(-)  are  the  first  two  derivatives  of  the  gamma  function, 
respectively.  (Note  that  r"(l)  —  r'(l)2  ~  1.64./ 

Thus  the  expected  largest  squared  mass  is  about  2  \/log  m  standard  de¬ 
viations  above  the  mean,  with  asymptoticlly  vanishing  variance.  Cohen  [5] 
has  derived  similar  results  for  generalized  matrix  products.  The  proof  of 
Proposition  1  is  left  to  section  7. 
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BCJ  operates  by  maximizing  the  mass  of  the  selected  off-diagonal  blocks. 
This  works  well  when  the  ratio  (n  -f  2o,\/logm)//i  is  large  while  the  addi¬ 
tional  cost  to  determine  the  largest  block  is  low.  Both  cost  and  benefit 
decrease  with  increasing  blocksize. 

For  certain  values  of  b.  BCJ  inherits  the  fast  convergence  of  the  classical 
Jacobi  method  without  paying  a  large  cost  for  maximal  selection.  If  b  is 
chosen  approximately  b  =  (log  n)lF  and  K  =  m/2,  the  work  of  computing 
and  selecting  the  independent  maximal  blocks  is  0  (n?{logn)1,/3^  per  iter¬ 
ation,  as  is  the  rotation  work,  so  that  the  two  are  of  comparable  sizes.  For 
larger  block  sizes  b ,  the  block  selection  cost  is  asymptotically  negligible.  If  b 
grows  as  i/log  n ,  then  the  largest  squared  block  mass  is  a  constant  multiple 
of  the  average  squared  block  mass,  while  the  extra  cost  of  determining  the 
maximal  off-diagonal  blocks  is  of  smaller  order. 

Figure  1  clearly  indicates  the  benefits  of  choosing  a  moderate  blocksize, 
as  the  average  solution  time  initially  decreases  as  b  grows.  However,  the 
use  of  a  large  b  produces  longer  runtimes. 

The  selection  of  I\  >  1  maximal  independent  off-diagonal  blocks  (step  2) 
forms  a  more  complex  sum  of  conditional  order  statistics,  which  we  now 
examine.  Let  X 1  <  i  <  M\,  be  iid  random  variates  with  density  f(x) 
and  distribution  F(x).  Denote  by  X ^  the  maximal  order  statistic.  Now 
fix  a  particular  subset  of  size  M2  of  the  remaining  variates  (excluding  the 
selected  maximum  and  others),  and  let  X be  the  maximal  variate  in 
the  subset.  It  is  clear  that  Xf^l<Mi  <  X^ .  Inductively  define  Xm+^m 

from  X^ . M  as  the  maximal  order  statistic  in  a  chosen  subset  of  Mk+i 

variates  selected  from  the  previous  subset  of  size  Mk  (excluding  the  previous 
maximum  and  others).  We  call  X$  %fk  the  klh  conditional  maximal  order 
statistic  of  the  {A;}^'r 

Proposition  2.  For  Mi  >  M2  >  ■  •  •  >  Mk  >  0,  the  probability  distribution 
of  the  kih  conditional  maximal  order  statistic  is 


Pr{-Y«. . «. 


(8) 


Letting  //,*/,  =  be  the  unconditional  mean  of  the  maximum  on 
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Figure  5:  Conditional  maximum  selection  -Y(1i4)  =  -Yj’5  =  10,  -Y(2,3)  = 

^15,6  =  5,^Y(5,6)  =  -Yl  5,6,1  =  4- 


Mi  observations,  we  have 


M, 


e^Xm' . ^  ^  m  n  ^  Mj  -  Mi , 


(9) 


We  briefly  indicate  the  formulation  of  the  first  step  of  BCJ  in  terms  of 
Proposition  2.  In  BCJ,  K  maximal  off-diagonal  blocks  are  selected  in  K 
stages  from  an  m  x  m  upper  triangular  array  of  (m2  —  m)/2  tid  random 
variates.  Independence  of  the  selected  locations  requires  striking  out  the 
row  and  column  of  the  maximum.  At  stage  i,  1  <  i  <  A',  the  maximal 
variate  will  be  drawn  from  a  subset  of  blocks  in  the  strict  upper 

triangle  of  the  array  and  then  two  rows  and  columns  of  the  array  will  be 
struck  out,  corresponding  to  the  row  and  column  indices  of  the  selected 
maximal  element. 

For  instance,  in  the  6  by  6  example  of  Figure  5,  the  first  maximum  is  10 
(row  1,  column  4).  Thereafter  rows  and  columns  1  and  4  are  struck  from 
the  array  (to  preserve  independence)  and  the  second  conditional  maximum 
is  selected;  it  is  5  (row  2,  column  3).  Note  that  larger  elements  that  are 
dependent  upon  the  first  maximum  may  be  ignored  in  the  selection  of  the 
second  maximum.  Finally,  rows  and  columns  2  and  3  are  struck  from  the 
array  and  the  final  maximum  of  4  (row  5,  column  6)  is  selected. 

The  selection  of  the  K  maximal  independent  off-diagonal  blocks  (which 
forms  the  more  complex  sum  of  conditional  order  statistics  discussed  above), 
determines  on  average  a  smaller  sum  of  off-diagonal  masses  than  K  suc¬ 
cessive  iterations  choosing  the  single  largest  block.  However,  it  is  observed 
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in  Figure  4  that  the  number  of  sweeps  to  convergence  initially  declines  as 
I\  increases.  This  probably  reflects  the  amortization  of  step  5  costs  over 
additional  blocks.  As  expected,  BCJ  requires  slightly  more  iterations  to 
converge  as  K  reaches  its  upper  limit  of  m/2  ( e.g b  =  2,4). 

Figure  2  presents  in  graphical  form  the  ratios  of  the  average  BCJ  and 
blocked  Brent-Luk  execution  times  on  a  Multiflow  Trace/7  computer.  The 
efficiency  ratio  shows  the  speedup  of  BCJ,  with  improvements  up  to  a  factor 
of  3.6  due  entirely  to  improved  index  selection.  Examination  of  Table  1 
shows  that,  for  almost  all  cases,  BCJ  runtimes  have  lower  deviations  from 
the  mean. 

Asymptotically,  b  =  Q  (jlogn)1^3'j  guarantees  that  the  work  of  selecting 
the  maximal  blocks  will  be  at  most  comparable  to  the  other  arithmetic 
operations.  However,  assuming  normality  of  initial  data  and  intermediate 
results,  the  optimal  6  so  that  largest  blocks  are  substantially  larger  than 
average  (S2  =  O^^/logm^),  from  eq.  (6))  is  b  =  ©(yflogn).  For  b  « 
(logn)",  1/3  <  a  <  1/2,  BCJ  should  be  asymptotically  faster  than  a 
blocked  Brent-Luk  method.  The  numerical  experiments  show  speedups  for 
problems  of  moderate  size. 

In  general,  the  distribution  of  the  elements  of  will  be  more  complex 
than  described  here  and  the  order  statistic  argument  must  be  specialized 
to  include  the  distributions  of  intermediate  results,  in  order  to  rigorously 
prove  rates  of  convergence.  However,  the  improved  performance  of  the  new 
algorithm  is  consonant  with  the  analysis  performed  here. 

In  cases  where  the  matrix  has  few  large  elements  or  is  close  to  diagonal, 
one  expects  BCJ  to  acheive  shorter  runtimes  than  indicated  by  these  ex¬ 
periments  on  uniform  random  data.  For  instance,  the  method  may  prove 
useful  in  adaptive  signal  processing  algorithms  that  rely  on  eigenvalue  de¬ 
compositions  [14] . 

6  Conclusions 

The  improved  index  selection  process  of  BCJ  produces  a  substantial  overall 
reduction  in  the  program  running  time,  compared  to  a  blocked  Brent-Luk 
algorithm.  In  particular,  the  extra  work  of  determining  the  largest  off- 
diagonal  blocks  is  offset  by  fewer  iterations  needed  for  convergence;  the 
overhead  is  negligible  for  large  problems.  Furthermore,  because  the  algo- 
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4  rithm  employs  blocked  data  concepts,  it  is  appropriate  for  computers  with 

a  hierarchical  memory  system.  The  concentration  of  work  on  the  relatively 
small  and  numerous  block  elements  is  advantageous  for  parallelization  of  the 
algorithm.  However,  it  seems  that  the  block  rotation  strategy  of  BCJ  and 
BBL  reduces  the  convergence  rate  of  both  methods,  leading  to  significantly 
longer  run-times.  A  block  rotation  contributes  less  toward  diagonalizing  the 
matrix  than  an  equal  number  of  scalar  rotations  performed  in  the  Sameh 
order. 

The  selection  of  parameters  b  and  K  is  important  to  the  efficiency  of 
BCJ.  A  moderate  value  of  6  gives  the  lowest  rim  times  (though  not  the 
lowest  number  of  block  sweeps).  The  extra  benefit  of  increasing  K  falls  off 
rapidly  for  small  n. 

Nearly  all  stages  of  the  algorithm  are  amenable  to  efficient  parallel  com¬ 
putation.  Step  1  can  be  computed  independently  on  m2  processors;  step  2 
on  various  combinations  of  processors  and  interconnections;  step  3  on  K 
,  large- grained  or  K b  fine-grained  processors,  depending  on  whether  the  block 

rotation  is  parallelized  or  not;  step  4  on  up  to  bKn  processors;  step  5  on 
m2  or  more  processors;  and  step  6  on  K  or  more  processors. 

*  This  investigation  of  BCJ  was  prompted  by  the  use  of  a  blocked  Brent- 
Luk  method  in  the  Saxpy  Computer  Corp.  mathematical  subroutine  li¬ 
brary.  It  appears  that  a  BCJ  method  could  be  more  efficient  than  the 
BBL  approach  chosen  there.  It  is  possible  that  BCJ  would  show  improved 
performance  on  more  nearly  diagonal  or  sparse  matrix  problems. 

7  Calculation  of  Distributions  of  the  Max¬ 
imal  and  kth  Conditional  Maximal  Order 
Statistics 

Proof  of  Proposition  1.  David  [6]  presents  the  limiting  distributional  behav¬ 
ior  of  the  maximal  order  statistic  X“  n,  which  depends  upon  the  well-known 
distribution 

A(x)  =  exp(  —  e~x)  —  oo  <  x  <  oo  (10) 

in  the  case  of  iid  0-1  normal  variates.  We  now  carry  through  the  analysis 
for  general  iid  normal  variates. 

Let  <f>ua 2(1)  =  (<7\/27r)-1  exp(— (x  —  y.)2 /2cr2)  be  the  normal  density  func- 

*  tion  and  let  $(1£r?(x)  =  /f^  <ftiai{y)dy  be  the  normal  distribution  function 
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V 


corresponding  to  mean  fj.  and  variance  a2,  respectively.  For  large  x, 


1  -  $Mg2(x)  _  <72 


fiuo2  (■*•) 


X  —  [X 


(11) 


based  on  a  change  of  variables  from  the  case  /i  =  0  and  (7  =  1.  Thus 
Theorem  9.3.5  [6]  applies  and 


Jim Pr  {(■*»,«  ~  Z»)n4vj(Z«)  <  *}  =  A(I) 


(12) 


holds  uniformly  for  every  x  E  (— oo,  oo),  where  ln  is  selected  so  that  ^^(/n)  = 
1  -  1  jn. 

According  to  (11), 


-  =  1  - — 

n  ln  —  fi  (T\/2t 


2  i 

exp 


2(72 


The  asymptotic  form  of  /„  is  then 

W-Hre  +  0(1/ log  »). 

Using  the  relation  m£w<72(Zn)  =  (/„  —  /z)/<72  +  0(l~l),  we  see  that 
pr  {(X*  -  /„)(/„  -  m)/<72  <  *}  =  A(x). 

It  follows  directly  from  (15)  that 

pryi  _  ,  ,  HIV  r,_,, 

=  n  +  <7^/2  log  n  +  O  ^loglogn/0ognj 


(13) 

(14) 

(15) 

(16) 
(17) 


where  r'(l)  (Euler’s  constant)  is  the  mean  associated  with  A(x).  The  vari¬ 
ance  vanishes  asymptotically  according  to 


(18) 


=  (7 2 


rw(i)  -r(i)2 


2  log 


- - +  o  (log  log  n/(log  n)2)  ,  ( 19) 
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where  r"(l)  —  r'(l)2  is  the  variance  associated  with  A(x).  (Here  we  have 
used  r'(-)  and  r"(-)  to  represent  the  first  two  derivatives  of  the  gamma 
function,  respectively.)  □ 

Proof  of  Proposition  2.  Let  =  x  |  Xi  <  y|  denote  the  event  that 

the  maximal  order  statistic  on  A/*  observations  is  x,  conditioned  on  all 
observations  X,  in  the  subset  of  size  Mk  being  bounded  above  by  y.  Then 
the  density  of  the  kth  conditional  maximal  order  statistic  obeys  the  relation 

p-T*;?, . «.  =  *}  (20) 


=  J H  Pr  {XJk'j,,.,  =  !/}  Pr  {X-MkM,  =  X  I  X<  <  y}  dy, 


where 


Pr{XM,.M.=*l X,  <  y}  =  l  f  •  -oo<x<y<no  (21) 

k  1  (.0,  —  oo  <  y  <  x  <  oo 

is  the  probability  distribution  of  the  maximal  order  statistic  on  M k  bounded 
observations. 

Define  P<.(x)  =  Pr  {Xfa  Mk  <  xj.  Then  Vx(x)  —  F(x)Ml .  Inductively 
assuming  that  Vk(x)  =  ai,kF(x)M'  gives  a  recurrence  relation  on  the 
atk  of 

Qik  —  Qik-i~ — ~T7'  l<i<k,  (22) 


Mk-Mi ' 


and 


M{ 


O kk  —  /  ,  Qik— 1  >  r  ./  i 

M{  -  Mk 


(23) 


where  an  =  1.  Consideration  of  the  k  —  1  degree  Lagrangian  polynomial 
interpolating  the  points  (M;,l),  1  <  i  <  k,  establishes  that 


-  rr  (  M>  ' 

n  U  -  Mi, 

3*i 


(24) 


The  distribution  is  thus 


Pr  ^ . * s  d  =  g  n  °  (25) 


i 
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Figure  1.  Average  BCJ  solution  time  for  N=63.  T0L  =  100-8 
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Figure  5  appears  on  page  11. 
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